Overview

Dataset statistics

Number of variables14
Number of observations3790
Missing cells3956
Missing cells (%)7.5%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory414.7 KiB
Average record size in memory112.0 B

Variable types

NUM9
CAT4
BOOL1

Reproduction

Analysis started2020-07-10 12:38:32.755950
Analysis finished2020-07-10 12:38:48.603436
Duration15.85 seconds
Versionpandas-profiling v2.8.0
Command linepandas_profiling --config_file config.yaml [YOUR_FILE.csv]
Download configurationconfig.yaml

Warnings

headquarter has constant value "0" Constant
date_of_establishment has a high cardinality: 843 distinct values High cardinality
location has a high cardinality: 1489 distinct values High cardinality
loc.details has a high cardinality: 299 distinct values High cardinality
location.Code is highly correlated with idHigh correlation
id is highly correlated with location.CodeHigh correlation
deposit_amount_2012 is highly correlated with deposit_amount_2011 and 5 other fieldsHigh correlation
deposit_amount_2011 is highly correlated with deposit_amount_2012 and 5 other fieldsHigh correlation
deposit_amount_2013 is highly correlated with deposit_amount_2011 and 5 other fieldsHigh correlation
deposit_amount_2014 is highly correlated with deposit_amount_2011 and 5 other fieldsHigh correlation
deposit_amount_2015 is highly correlated with deposit_amount_2011 and 5 other fieldsHigh correlation
deposit_amount_2016 is highly correlated with deposit_amount_2011 and 5 other fieldsHigh correlation
deposit_amount_2017 is highly correlated with deposit_amount_2011 and 5 other fieldsHigh correlation
date_of_establishment has 2040 (53.8%) missing values Missing
deposit_amount_2011 has 740 (19.5%) missing values Missing
deposit_amount_2012 has 578 (15.3%) missing values Missing
deposit_amount_2013 has 329 (8.7%) missing values Missing
deposit_amount_2014 has 175 (4.6%) missing values Missing
deposit_amount_2015 has 56 (1.5%) missing values Missing
deposit_amount_2011 is highly skewed (γ1 = 54.23092623) Skewed
deposit_amount_2012 is highly skewed (γ1 = 55.80428776) Skewed
deposit_amount_2013 is highly skewed (γ1 = 57.73524155) Skewed
deposit_amount_2014 is highly skewed (γ1 = 58.94451219) Skewed
deposit_amount_2015 is highly skewed (γ1 = 59.86730287) Skewed
deposit_amount_2016 is highly skewed (γ1 = 60.28538584) Skewed
deposit_amount_2017 is highly skewed (γ1 = 60.28538584) Skewed
id has unique values Unique
location.Code has unique values Unique
deposit_amount_2011 has 47 (1.2%) zeros Zeros
deposit_amount_2012 has 45 (1.2%) zeros Zeros
deposit_amount_2013 has 49 (1.3%) zeros Zeros
deposit_amount_2014 has 50 (1.3%) zeros Zeros
deposit_amount_2015 has 51 (1.3%) zeros Zeros
deposit_amount_2016 has 50 (1.3%) zeros Zeros
deposit_amount_2017 has 50 (1.3%) zeros Zeros

Variables

id
Real number (ℝ≥0)

HIGH CORRELATION
UNIQUE

Distinct count3790
Unique (%)100.0%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1895.5
Minimum1
Maximum3790
Zeros0
Zeros (%)0.0%
Memory size29.6 KiB

Quantile statistics

Minimum1
5-th percentile190.45
Q1948.25
median1895.5
Q32842.75
95-th percentile3600.55
Maximum3790
Range3789
Interquartile range (IQR)1894.5

Descriptive statistics

Standard deviation1094.223088
Coefficient of variation (CV)0.5772741167
Kurtosis-1.2
Mean1895.5
Median Absolute Deviation (MAD)947.5
Skewness0
Sum7183945
Variance1197324.167
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
20471< 0.1%
 
6211< 0.1%
 
6451< 0.1%
 
26921< 0.1%
 
6411< 0.1%
 
26881< 0.1%
 
6371< 0.1%
 
26841< 0.1%
 
6331< 0.1%
 
26801< 0.1%
 
Other values (3780)378099.7%
 
ValueCountFrequency (%) 
11< 0.1%
 
21< 0.1%
 
31< 0.1%
 
41< 0.1%
 
51< 0.1%
 
ValueCountFrequency (%) 
37901< 0.1%
 
37891< 0.1%
 
37881< 0.1%
 
37871< 0.1%
 
37861< 0.1%
 

headquarter
Boolean

CONSTANT
REJECTED

Distinct count1
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size29.6 KiB
0
3790
ValueCountFrequency (%) 
03790100.0%
 

location.Code
Real number (ℝ≥0)

HIGH CORRELATION
UNIQUE

Distinct count3790
Unique (%)100.0%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean5395.806332453826
Minimum2871
Maximum7994
Zeros0
Zeros (%)0.0%
Memory size29.6 KiB

Quantile statistics

Minimum2871
5-th percentile3078.45
Q14067.25
median5261.5
Q36863.25
95-th percentile7779.55
Maximum7994
Range5123
Interquartile range (IQR)2796

Descriptive statistics

Standard deviation1549.105135
Coefficient of variation (CV)0.2870942802
Kurtosis-1.284903617
Mean5395.806332
Median Absolute Deviation (MAD)1409
Skewness0.09235452196
Sum20450106
Variance2399726.72
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
40941< 0.1%
 
28841< 0.1%
 
29001< 0.1%
 
49471< 0.1%
 
69941< 0.1%
 
28961< 0.1%
 
49431< 0.1%
 
69901< 0.1%
 
28921< 0.1%
 
49391< 0.1%
 
Other values (3780)378099.7%
 
ValueCountFrequency (%) 
28711< 0.1%
 
28721< 0.1%
 
28731< 0.1%
 
28741< 0.1%
 
28751< 0.1%
 
ValueCountFrequency (%) 
79941< 0.1%
 
79931< 0.1%
 
79891< 0.1%
 
79871< 0.1%
 
79861< 0.1%
 

date_of_establishment
Categorical

HIGH CARDINALITY
MISSING

Distinct count843
Unique (%)48.2%
Missing2040
Missing (%)53.8%
Memory size29.6 KiB
1920-01-01
 
129
1890-01-01
 
111
1966-05-05
 
32
1935-01-11
 
30
2004-01-07
 
27
Other values (838)
1421
ValueCountFrequency (%) 
1920-01-011293.4%
 
1890-01-011112.9%
 
1966-05-05320.8%
 
1935-01-11300.8%
 
2004-01-07270.7%
 
1924-01-01270.7%
 
1935-01-07260.7%
 
1934-01-12170.4%
 
1908-01-01140.4%
 
1916-12-31140.4%
 
Other values (833)132334.9%
 
(Missing)204053.8%
 

Length

Max length10
Median length3
Mean length6.232189974
Min length3

location
Categorical

HIGH CARDINALITY

Distinct count1489
Unique (%)39.3%
Missing0
Missing (%)0.0%
Memory size29.6 KiB
Chicago
 
99
New York City
 
76
Houston
 
72
Los Angeles
 
58
Indianapolis
 
49
Other values (1484)
3436
ValueCountFrequency (%) 
Chicago992.6%
 
New York City762.0%
 
Houston721.9%
 
Los Angeles581.5%
 
Indianapolis491.3%
 
Miami441.2%
 
San Francisco421.1%
 
Brooklyn401.1%
 
Seattle360.9%
 
San Diego350.9%
 
Other values (1479)323985.5%
 

Length

Max length22
Median length9
Mean length9.158047493
Min length4

loc.details
Categorical

HIGH CARDINALITY

Distinct count299
Unique (%)7.9%
Missing0
Missing (%)0.0%
Memory size29.6 KiB
Los Angeles
 
298
Cook
 
159
Orange
 
149
Harris
 
102
San Diego
 
92
Other values (294)
2990
ValueCountFrequency (%) 
Los Angeles2987.9%
 
Cook1594.2%
 
Orange1493.9%
 
Harris1022.7%
 
San Diego922.4%
 
Maricopa902.4%
 
King862.3%
 
Miami-Dade822.2%
 
New York762.0%
 
Clark711.9%
 
Other values (289)258568.2%
 

Length

Max length20
Median length7
Mean length7.517150396
Min length3

state
Categorical

Distinct count25
Unique (%)0.7%
Missing0
Missing (%)0.0%
Memory size29.6 KiB
CA
1003
NY
425
FL
391
TX
378
IL
 
252
Other values (20)
1341
ValueCountFrequency (%) 
CA100326.5%
 
NY42511.2%
 
FL39110.3%
 
TX37810.0%
 
IL2526.6%
 
WA2045.4%
 
NJ1794.7%
 
IN1774.7%
 
CO1143.0%
 
OR1133.0%
 
Other values (15)55414.6%
 

Length

Max length2
Median length2
Mean length2
Min length2

deposit_amount_2011
Real number (ℝ≥0)

HIGH CORRELATION
MISSING
SKEWED
ZEROS

Distinct count2955
Unique (%)96.9%
Missing740
Missing (%)19.5%
Infinite0
Infinite (%)0.0%
Mean168320.07688524592
Minimum0.0
Maximum230365992.0
Zeros47
Zeros (%)1.2%
Memory size29.6 KiB

Quantile statistics

Minimum0
5-th percentile7801.65
Q128398
median53442
Q399109.125
95-th percentile222504.9
Maximum230365992
Range230365992
Interquartile range (IQR)70711.125

Descriptive statistics

Standard deviation4196386.456
Coefficient of variation (CV)24.9309918
Kurtosis2973.048844
Mean168320.0769
Median Absolute Deviation (MAD)30817.5
Skewness54.23092623
Sum513376234.5
Variance1.760965929e+13
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
0471.2%
 
25234.530.1%
 
4629920.1%
 
966320.1%
 
1054820.1%
 
4093820.1%
 
17918120.1%
 
4993220.1%
 
28243.520.1%
 
55513.520.1%
 
Other values (2945)298478.7%
 
(Missing)74019.5%
 
ValueCountFrequency (%) 
0471.2%
 
1561< 0.1%
 
172.51< 0.1%
 
274.51< 0.1%
 
562.51< 0.1%
 
ValueCountFrequency (%) 
2303659921< 0.1%
 
228139771< 0.1%
 
7374436.51< 0.1%
 
59908801< 0.1%
 
5568664.51< 0.1%
 

deposit_amount_2012
Real number (ℝ≥0)

HIGH CORRELATION
MISSING
SKEWED
ZEROS

Distinct count3104
Unique (%)96.6%
Missing578
Missing (%)15.3%
Infinite0
Infinite (%)0.0%
Mean188270.4662204234
Minimum0.0
Maximum291582000.0
Zeros45
Zeros (%)1.2%
Memory size29.6 KiB

Quantile statistics

Minimum0
5-th percentile7042.275
Q130199.125
median55774.5
Q3100420.5
95-th percentile232479.15
Maximum291582000
Range291582000
Interquartile range (IQR)70221.375

Descriptive statistics

Standard deviation5171072.99
Coefficient of variation (CV)27.46619315
Kurtosis3143.282123
Mean188270.4662
Median Absolute Deviation (MAD)31239.75
Skewness55.80428776
Sum604724737.5
Variance2.673999587e+13
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
0451.2%
 
30319.530.1%
 
68905.530.1%
 
2396420.1%
 
12310.520.1%
 
161722.520.1%
 
75295.520.1%
 
2247020.1%
 
2293820.1%
 
5391320.1%
 
Other values (3094)314783.0%
 
(Missing)57815.3%
 
ValueCountFrequency (%) 
0451.2%
 
4.51< 0.1%
 
1171< 0.1%
 
1801< 0.1%
 
2131< 0.1%
 
ValueCountFrequency (%) 
2915820001< 0.1%
 
263066821< 0.1%
 
89796931< 0.1%
 
71894161< 0.1%
 
66913081< 0.1%
 

deposit_amount_2013
Real number (ℝ≥0)

HIGH CORRELATION
MISSING
SKEWED
ZEROS

Distinct count3356
Unique (%)97.0%
Missing329
Missing (%)8.7%
Infinite0
Infinite (%)0.0%
Mean193380.30424732735
Minimum0.0
Maximum311051982.0
Zeros49
Zeros (%)1.3%
Memory size29.6 KiB

Quantile statistics

Minimum0
5-th percentile8251.5
Q131597.5
median59616
Q3107244
95-th percentile251787
Maximum311051982
Range311051982
Interquartile range (IQR)75646.5

Descriptive statistics

Standard deviation5320718.404
Coefficient of variation (CV)27.51427259
Kurtosis3370.609983
Mean193380.3042
Median Absolute Deviation (MAD)33573
Skewness57.73524155
Sum669289233
Variance2.831004433e+13
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
0491.3%
 
5018420.1%
 
62122.520.1%
 
62671.520.1%
 
3168920.1%
 
76456.520.1%
 
1767620.1%
 
80827.520.1%
 
3018020.1%
 
4477.520.1%
 
Other values (3346)339489.6%
 
(Missing)3298.7%
 
ValueCountFrequency (%) 
0491.3%
 
82.51< 0.1%
 
142.51< 0.1%
 
1561< 0.1%
 
1981< 0.1%
 
ValueCountFrequency (%) 
3110519821< 0.1%
 
318988081< 0.1%
 
9417007.51< 0.1%
 
9172369.51< 0.1%
 
59432041< 0.1%
 

deposit_amount_2014
Real number (ℝ≥0)

HIGH CORRELATION
MISSING
SKEWED
ZEROS

Distinct count3504
Unique (%)96.9%
Missing175
Missing (%)4.6%
Infinite0
Infinite (%)0.0%
Mean204574.2684647303
Minimum0.0
Maximum335093029.5
Zeros50
Zeros (%)1.3%
Memory size29.6 KiB

Quantile statistics

Minimum0
5-th percentile10120.05
Q134971.75
median63537
Q3114528.75
95-th percentile266363.7
Maximum335093029.5
Range335093029.5
Interquartile range (IQR)79557

Descriptive statistics

Standard deviation5610535.906
Coefficient of variation (CV)27.42542328
Kurtosis3515.557103
Mean204574.2685
Median Absolute Deviation (MAD)34978.5
Skewness58.94451219
Sum739535980.5
Variance3.147811315e+13
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
0501.3%
 
5191230.1%
 
5241330.1%
 
104863.520.1%
 
3177320.1%
 
20734.520.1%
 
10590620.1%
 
8163320.1%
 
46066.520.1%
 
2472320.1%
 
Other values (3494)354593.5%
 
(Missing)1754.6%
 
ValueCountFrequency (%) 
0501.3%
 
1081< 0.1%
 
229.51< 0.1%
 
274.51< 0.1%
 
364.51< 0.1%
 
ValueCountFrequency (%) 
335093029.51< 0.1%
 
344498701< 0.1%
 
12502327.51< 0.1%
 
9670909.51< 0.1%
 
7888276.51< 0.1%
 

deposit_amount_2015
Real number (ℝ≥0)

HIGH CORRELATION
MISSING
SKEWED
ZEROS

Distinct count3642
Unique (%)97.5%
Missing56
Missing (%)1.5%
Infinite0
Infinite (%)0.0%
Mean218387.40747188003
Minimum0.0
Maximum362310873.0
Zeros51
Zeros (%)1.3%
Memory size29.6 KiB

Quantile statistics

Minimum0
5-th percentile12103.425
Q139358.5
median70158
Q3124944.75
95-th percentile284444.025
Maximum362310873
Range362310873
Interquartile range (IQR)85586.25

Descriptive statistics

Standard deviation5970415.949
Coefficient of variation (CV)27.33864566
Kurtosis3627.436235
Mean218387.4075
Median Absolute Deviation (MAD)37224.75
Skewness59.86730287
Sum815458579.5
Variance3.56458666e+13
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
0511.3%
 
57706.520.1%
 
63385.520.1%
 
4004420.1%
 
25114.520.1%
 
4906220.1%
 
3495020.1%
 
1134020.1%
 
7073120.1%
 
15379.520.1%
 
Other values (3632)366596.7%
 
(Missing)561.5%
 
ValueCountFrequency (%) 
0511.3%
 
511< 0.1%
 
769.51< 0.1%
 
928.51< 0.1%
 
10261< 0.1%
 
ValueCountFrequency (%) 
3623108731< 0.1%
 
39260689.51< 0.1%
 
107138431< 0.1%
 
10233148.51< 0.1%
 
71213971< 0.1%
 

deposit_amount_2016
Real number (ℝ≥0)

HIGH CORRELATION
SKEWED
ZEROS

Distinct count3672
Unique (%)97.4%
Missing19
Missing (%)0.5%
Infinite0
Infinite (%)0.0%
Mean236442.20365950678
Minimum0.0
Maximum391939125.0
Zeros50
Zeros (%)1.3%
Memory size29.6 KiB

Quantile statistics

Minimum0
5-th percentile17377.5
Q146321.5
median78774
Q3137349
95-th percentile306828.75
Maximum391939125
Range391939125
Interquartile range (IQR)91027.5

Descriptive statistics

Standard deviation6422120.282
Coefficient of variation (CV)27.16148041
Kurtosis3674.143162
Mean236442.2037
Median Absolute Deviation (MAD)40014
Skewness60.28538584
Sum891623550
Variance4.124362892e+13
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
0501.3%
 
21331.520.1%
 
109528.520.1%
 
35719.520.1%
 
3249620.1%
 
86626.520.1%
 
34861.520.1%
 
76015.520.1%
 
4389320.1%
 
110755.520.1%
 
Other values (3662)370397.7%
 
(Missing)190.5%
 
ValueCountFrequency (%) 
0501.3%
 
3781< 0.1%
 
598.51< 0.1%
 
985.51< 0.1%
 
1213.51< 0.1%
 
ValueCountFrequency (%) 
3919391251< 0.1%
 
40416823.51< 0.1%
 
103738261< 0.1%
 
10013170.51< 0.1%
 
79042861< 0.1%
 

deposit_amount_2017
Real number (ℝ≥0)

HIGH CORRELATION
SKEWED
ZEROS

Distinct count3672
Unique (%)97.4%
Missing19
Missing (%)0.5%
Infinite0
Infinite (%)0.0%
Mean354663.3054892601
Minimum0.0
Maximum587908687.5
Zeros50
Zeros (%)1.3%
Memory size29.6 KiB

Quantile statistics

Minimum0
5-th percentile26066.25
Q169482.25
median118161
Q3206023.5
95-th percentile460243.125
Maximum587908687.5
Range587908687.5
Interquartile range (IQR)136541.25

Descriptive statistics

Standard deviation9633180.424
Coefficient of variation (CV)27.16148041
Kurtosis3674.143162
Mean354663.3055
Median Absolute Deviation (MAD)60021
Skewness60.28538584
Sum1337435325
Variance9.279816507e+13
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
0501.3%
 
42367.520.1%
 
209866.520.1%
 
23447720.1%
 
98813.2520.1%
 
38684.2520.1%
 
9991820.1%
 
76396.520.1%
 
49988.2520.1%
 
53579.2520.1%
 
Other values (3662)370397.7%
 
(Missing)190.5%
 
ValueCountFrequency (%) 
0501.3%
 
5671< 0.1%
 
897.751< 0.1%
 
1478.251< 0.1%
 
1820.251< 0.1%
 
ValueCountFrequency (%) 
587908687.51< 0.1%
 
60625235.251< 0.1%
 
155607391< 0.1%
 
15019755.751< 0.1%
 
118564291< 0.1%
 

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Missing values

Sample

First rows

idheadquarterlocation.Codedate_of_establishmentlocationloc.detailsstatedeposit_amount_2011deposit_amount_2012deposit_amount_2013deposit_amount_2014deposit_amount_2015deposit_amount_2016deposit_amount_2017
01028711911-06-02WalesWaukeshaWI32079.035971.537237.540362.046021.546020.069030.00
1202872NaNGermantownWashingtonWI83181.084846.097098.0110284.5122035.5133905.0200857.50
23028731908-06-04BrookfieldWaukeshaWI136323.0156450.0187557.0188859.0198751.5206044.5309066.75
3402874NaNPewaukeeWaukeshaWI68511.073932.079876.5105603.0112113.0110755.5166133.25
4502875NaNWaukeshaWaukeshaWI96271.5108325.5104880.0121054.5113956.5109837.5164756.25
5602876NaNWaukeshaWaukeshaWI93837.0101592.0118270.5140280.0150987.0168742.5253113.75
6702877NaNBrookfieldWaukeshaWI117655.5130725.0153216.0179154.0199660.5214266.0321399.00
78028781961-04-01New BerlinWaukeshaWI126933.0144072.0155919.0164754.0181075.5184749.0277123.50
89028791933-02-05OconomowocWaukeshaWI72700.573044.082053.085413.083767.587390.0131085.00
91002880NaNButlerWaukeshaWI73921.573033.573011.078331.580385.083619.0125428.50

Last rows

idheadquarterlocation.Codedate_of_establishmentlocationloc.detailsstatedeposit_amount_2011deposit_amount_2012deposit_amount_2013deposit_amount_2014deposit_amount_2015deposit_amount_2016deposit_amount_2017
37803781079802016-03-11Laguna NiguelOrangeCANaNNaNNaNNaNNaNNaNNaN
37813782079812016-07-11Oklahoma CityOklahomaOKNaNNaNNaNNaNNaNNaNNaN
3782378307983NaNTampaHillsboroughFLNaNNaNNaNNaNNaNNaNNaN
37833784079842017-10-05JacksonvilleDuvalFLNaNNaNNaNNaNNaNNaNNaN
3784378507985NaNWoodland HillsLos AngelesCANaNNaNNaNNaNNaNNaNNaN
37853786079862016-03-10ComptonLos AngelesCANaNNaNNaNNaNNaNNaNNaN
37863787079872017-02-01Las VegasClarkNVNaNNaNNaNNaNNaNNaNNaN
3787378807989NaNIrvineOrangeCANaNNaNNaNNaNNaNNaNNaN
37883789079932016-12-31New OrleansOrleansLANaNNaNNaNNaNNaNNaNNaN
37893790079942016-12-31BuffaloErieNYNaNNaNNaNNaNNaNNaNNaN